feat!: support ray distributed IVF_SQ/PQ/FLAT index builder#67
feat!: support ray distributed IVF_SQ/PQ/FLAT index builder#67chenghao-guo merged 4 commits intolance-format:mainfrom
Conversation
8d9705e to
e271a7c
Compare
Testing EnvironmentWe conducted tests using the
Testing Results
IVF_SQ Group (Tested on S3)
IVF_FLAT Group
IVF_PQ Group
Overall ObservationSince it works on S3, network speed and S3 I/O may vary during the process. Sometimes it can take hours, but mainly due to the reasons below. The distributed index was built using 4-core, 16GB machines (4 workers in total). We also checked if performance could scale linearly. The main reasons for the performance acceleration are:
ConclusionThe test was carried out with a 4-core, 16GB machine and an S3 object-store. The results clearly show that the distributed setup significantly outperforms the single-machine setup in terms of index building time, especially for the We can estimate speedup as: LimitationSince PQ (Product Quantization) depends on:
both of which are carried out on a single machine, if the training of the PQ codebook takes a long time, the acceleration effect of IVF-PQ may not be satisfactory. However, it is more suitable for a larger number of tables. The most well-balanced and superior method appears to be IVF_SQ (Inverted File with Scalar Quantization), as this method can generally guarantee good recall and support distributed parallelism. |
10e3078 to
ac0be16
Compare
jackye1995
left a comment
There was a problem hiding this comment.
sorry for making some conflicting changes, I rebased and this looks good to me
|
@chenghao-guo can you make sure we add this to documentation? Another thing missing is we probably should propagate GPU configs like |
Hi Jack, thanks a lot for the review. I’ll update the documentation accordingly in the md file. |
close #66
Depends on this PR: lance-format/lance#5117
The new create_index function orchestrates a multi-phase workflow: